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PARALLEL DIRECTIONALLY SPLIT SOLVER BASED ON REFORMULATION OF 

PIPELINED THOMAS ALGORITHM 

A. POVITSKY * 


Abstract. A very efficient direct solver, known as the Thomas algorithm, is frequently used for the 
solution of band matrix systems that typically appear in models of mechanics. The processor idle time is a 
reason for the poor parallelization efficiency of the solvers based on the pipelined parallel Thomas algorithms. 

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The 
proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the 
backward step computations immediately after the completion of the forward step computations for the first 
portion of lines. This algorithm has data available for other computational tasks while processors are idle 
from the Thomas algorithm. 

The proposed 3-D directionally split solver is based on the static scheduling of processors where local and 
non-local, data- dependent and data-independent computations are scheduled while processors are idle. A 
theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show 
an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. 

It is shown by computational experiments and by the theoretical model that the proposed algorithm 
reduces the parallelization penalty about two times over the basic algorithm for the range of the number of 
processors (subdomains) considered and the number of grid nodes per subdomain. 

Key words, parallel computing, parallelization model, directionally split methods, pipelined Thomas 
Algorithm, banded matrices, ADI and FS methods 

Subject classification. Computer Science 

1. Introduction. Efficient solution of directionally split banded matrix systems is essential to multi- 
grid, compact, and implicit solvers. When the implicit schemes are applied to multi- dimensional problems, 
the operators are separated into one-dimensional components and the scheme is split into two (for 2-D 
problems) or three (for 3-D problems) steps, each one involving only the implicit operations originating 
from a single coordinate [1]. These numerical algorithms are denoted as fractional step (FS) or alternating 
direction implicit (ADI) methods. 

According to D. Caughey, the development of implicit CFD algorithms suitable for distributed memory 
machines will become increasingly important in the future. The parallelization of implicit schemes will require 
more ingenuity if they are to remain competitive with parallel explicit schemes. The interplay between the 
data structures, the memory assignment among processors, and its access by the flow solver will require a 
truly interdisciplinary approach to produce effective algorithms [2]. 

For block-banded systems the LU decomposition method leads to an efficient serial direct algorithm 
known as the Thomas algorithm. Parallelization of implicit solvers that use the Thomas algorithm for 
the solution of banded system of equations is hindered by global data dependencies. Parallel versions of 
the Thomas algorithm are of the pipelined type [3]. Pipelines occur due to the recurrence of data at the 
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forward and the backward step computations of the Thomas algorithm. During the pipelined process the first 
processor has to wait for the completion of the forward step computations and completion of the backward 
step computations for the first group of lines on all processors ahead. Thus, the first processor becomes idle 
at the switch from the forward to the backward step of the Thomas algorithm. In turn, the last processor 
has to wait for available data at the beginning of each spatial step. 

The use of the pipelined Thomas algorithm (PTA) results in communication between neighboring pro- 
cessors due to the transfer of coefficients on the forward step and transfer of solution on the backward step. 
One may use a separate message after completion of the forward or backward step for a single line, however, 
this results in the maximum communication time latency overhead. On the other hand, reducing the latency 
overhead by sending data from more than one line in each message increases the data dependency delay 
effect and processor idle time. This latency and data dependency delay tradeoff determines the optimum 
number of lines to be completed before a message is sent to the neighboring processor [4], [5]. V. Naik et al 
[4] defined this optimum number of lines and used it in their parallelization of an implicit finite-difference 
code for solving Euler and thin-layer Navier-Stokes equations. Still, there is the processor idle time between 
the forward and the backward step computations and global synchronization of processors at each spatial 
step of the ADI. This parallel implicit directionally split algorithm is referred to as the basic algorithm in 
this study. 

In our recent study [6], we formulated a new version of the Thomas parallel algorithm named the Immedi- 
ate Backward Pipelined Thomas Algorithm (IB-PTA). The backward step computations begin immediately 
after the forward step computations have been completed for the first portion of lines. Some lines have 
been rendered by the Thomas algorithm before processors become idle. Although the IB-PTA cannot reduce 
the processor idle time, there are data available for other computational tasks while processors are idle. A 
multi-dimensional numerical algorithm using the IB-PTA in each spatial direction can run on processors in 
the time-staggered manner without a global synchronization (the first processor finishes its computations 
first at each time step). The development of such algorithm is addressed in this study. 

Various computational tasks are examined for execution while processors are idle from the Thomas 
algorithm in the current direction. These tasks may include: (i) computation of the left-hand side coefficients 
for the next spatial stage, including computation of damping functions and non-linear TVD evaluations; 
(ii) computation of the right-hand side coefficients, for example , multiplications of intermediate FS or ADI 
functions by transformation metrics of curvilinear grids; and (iii) execution of the forward step computations 
of the Thomas algorithm in the next spatial direction. These tasks are classified here as either data-dependent 
(using the last spatial step solution) or data-independent. 

We develop a theoretical model to estimate a parallelization efficiency of the proposed algorithm. Such 
models are recognized as a quantitative basis for the design of parallel algorithms. We base our model on 
the idealized multicomputer parallel architecture model [7]. Iow-level hardware details such as memory 
hierarchies and the topology of the interconnection network ar ; not introduced in the model. This model 
is used here to define the optimal number of solved fines per message, to provide asymptotic analysis and 
to estimate of parallelization efficiency for a large number of processors which are not available yet, and to 
compare the proposed algorithm with the basic one. First, th;s model is used for a cubic global domain 
(equal number of grid nodes in all directions). Then an optimal partitioning for a global domain with 
unequal number of grid nodes in different directions is obtained based on this model. To get a unified 
approach for various MIMD computers, results are presented in terms of ratios between communication 
latency and transfer times to the backward step computation time per grid node. These ratios were used in 
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studies [8] and [9] for the complexity analysis of various parallel algorithms for numerical solutions of partial 
differential equations (PDE). 

As the development of a computer program by the proposed algorithm may represent certain difficulties, 
we design a parallel computer code using modular design techniques [7] and present it here. We generate 
static schedule of processors for the IB-PTA in a single direction by methods described in our study [6] and 
fashion schedules of the pipelined Thomas algorithm in different directions together with local computational 
tasks. Performing core computations of the algorithm, processors are governed by this static schedule. 
Computations are separated from communication between processors. One can switch to other type of 
banded matrix system to be solved, method of computing discretized coefficients of PDE, type of directionally 
split method, other schedule of processors, or type of message-passing library changing only corresponding 
modules of the code. 

This paper is composed of the following sections. In section 2, the parallelization method is described. 
In section 3, the theoretical model of parallelization efficiency is developed for 2-D and 3-D computational 
domains. In section 4, the parallelization method and the theoretical model of the parallelization efficiency 
are tested and verified by a numerical solution of a benchmark problem on CRAY T3E and IBM SP MIMD 
computers. 

2. Formulation of the algorithm. 

2.1. Mathematical formulation. Consider a non-linear partial differential equation 
(1) ~ =S{U)U + Q 

where t is the time, S(U) = S X (U) + S y (U) + S Z {U) is a spatial differential operator and Q is a source term. 
As an example of a directionally split method we consider the fractional step (FS) method. The method is 
based on a factorization of the Crank- Nicholson scheme, where the factorized scheme is solved in three steps 
as a succession of one-dimensional Crank-Nicholson schemes: 

(1 - 0.5A tS x )U^ = (1 + 0.5A tS x )U n + A tQ 


(1 - 0.5A tS y )U n + l = (1 + 0.5A tS y )U n ^ 


(2) (1 - 0.5A tS z )U n+l = (1 + 0.5A tS z )U n ^, 

where S x ,S y and S z are linear finite-difference operators in the x, y and z directions, respectively, and 
L rn+1 ,£/ n + 1 are intermediate FS functions. 

Second-order finite-difference formulation of this system leads to band tri-diagonal system of linear 
equations 

(3) a i,j,kUi—l,j,k 4“ = fi,j,ki 

where 2 , j, k are spatial grid nodes, coefficients bij^ and Cij^ are functions of U n and/or time, f iyj ^ 

is the r.h.s. of equations (2). The above system of linear equations corresponds to the first spatial step of 
Eqs. (2). The similar banded linearized systems must be solved for the second and the third spatial steps of 
the FS. 
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This system of A r3 equations is considered as N 2 systems of N equations where each system of N 
equations corresponds to j, k = const. Therefore, the scalar tridiagonal version of the Thomas algorithm is 
used to solve each system of N equations. The forward step of the Thomas algorithm for this system is 



^1 

_ t ,j*k 

,k Oi,j,k ^i,j,k j > 

d i- J,k 

i Ntotx 

9l f jyk = 
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where N to tx is the number of grid nodes in the x direction. The backward step of the Thomas algorithm is 
( 5 ) Us,j,k = 9N,j,k, Ui,j,k= 9ij,k -U i+ i <jtk !j^, k = N totx — 1, ..., 1 

The solution of this banded system of N equations is also denoted as the rendering of the line (j, k). 

2-2. Pipelined Thomas algorithms. To parallelize the Thomas algorithm, the coefficients of Eq. 
(3) are mapped into processors so that the subset {a t , Jifc , b t]<k , Ci tjtk \ i = N(I — 1) + 1, j = 

N(J — 1) + 1, ... ,NJ , k = N(K — 1) + 1, ..., NK } belongs to tfie (I,J,K) th processor. The computational 
domain is divided to Nj x Nj x N K subdomains with N x N x N grid nodes each one. 

The Pipelined Thomas Algorithm (PTA) [4] is cited shortly. The (J, J, K) th processor receives coefficients 
9N(i-i),j,k from the (I - 1 ,J,K) th processor; computes coefficients dij tk and gi.j,k- where l = 
N(I - 1) + 1, ...,NI of the forward step of the Thomas algorithm, sends coefficients d(NI.j, k), g(NI,j,k) 
to the (7 + 1, J, K) th processor and repeats computations (4) until the forward step computations for the 
N 2 lines are completed. After completion of all the forward step computations specific to a single processor, 
the ( I,J,K) th processor has to wait for the completion of the forward step by all processors ahead. The 
backward step computations (5) are performed in the similar manner as the forward step computations but 
in the decreasing direction. The lines (j, k) are gathered in groups and the number of lines is solved per 
message. Processors are idle between the forward and the backward steps of the Thomas algorithm and 
there are no data available for other computational tasks by this time. 

The Immediate Backward Thomas Pipeline Algorithm (IB-PTA) has been developed in our study [6] 
and is described here briefly. First, groups of lines are rendered by the forward step computations (see above) 
till the first group of lines is completed on the last processor. Then the backward step computations for each 
group of lines begin immediately after the completion of the forward step computations for these lines. Each 
processor switches between the forward and backward steps of the Thomas algorithm and communicates with 
its neighbors to receive necessary data for beginning of either the forward or the backward computations for 
the next portion of lines. Finally, remaining lines are rendered by the backward step computations and there 
are no available lines for the forward step computations. It was shown [6] that the idle time is the same for 
the IB-PTA and the basic PTA when these algorithms are used in a single direction. The advantage of the 
IB-PTA is that processors become idle after completion of subset lines. At this time processors can be used 
for other computational tasks requiring these data, as described in the following subsection. 

2.3. Schedule of processors for directionally split problem. The basic algorithm is executed 
in the y and z directions the same way as in the x direction. After completion of all Thomas algorithm 
computations in the last spatial direction, processors compute S(U) operators. Communications control 
computational tasks as either the forward step coefficients or the backward step solution must be obtained 
from the neighboring processors for the beginning of either the 1 arward or the backward step computations, 
and there are no other computational tasks while processors wa t for these data. 
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This is no longer the case for the proposed algorithm, because processors execute other computational 
tasks while these processors are idle. Additionally, these tasks might be manifold, and the idle processor 
times are different for the different processors. Therefore, the static scheduling of processors is adopted in 
this study, i.e., the communication and computations schedule of processors is computed before numerical 
computations are executed. 

We define a time unit as a time interval when a processor either renders a group of lines by one type of 
computations or is idle. Processors may communicate only at the beginning of the time unit. The length of 
a time unit is the same for all processors. This schedule of processors includes the order of computational 
tasks on each processor and the order of communication with its neighbors. 

Types of computations in this study include non-local (the forward step computations and the back- 
ward step computations of the Thomas Algorithm) and local (computations of spatial operators S(U) or 
computations of the right hand side (r.h.s.) of equations (2)). Most of considered computational tasks, 
namely, all the Thomas algorithm computations, local computations of the r.h.s., and local computations of 
spatial operators at the end of a time step use results of the Thomas algorithm computations in the last ren- 
dered direction, i.e., these tasks are data-dependent. However, computations of the S y and S z operators are 
data-independent from the results of the Thomas algorithm computations in the directions x and y . These 
operators depend upon solution U n and do not depend upon intermediate functions J7 n+1 and [7 n+1 . Both 
data-dependent and data-independent tasks may be executed while processors are idle from the Thomas 
algorithm. However, the practical realization is different for data-dependent, data-independent, local and 
non-local tasks. Additionally, each type of computations has a different computational time per grid node; 
therefore, the number of rendered lines per time unit are different for various types of computations (see the 
next section about calculation of these numbers). 

Computations are governed by an integer- valued variable: 


( 6 ) 




f 

4 local computations 
+/ forward step computations 
0 processor is idle 
—l backward step computations 


where p is the processor number, i is the number of time unit, l = 1, 2,3 corresponds to the directions x, y 
and z, respectively. 

A scheduling algorithm for PTAs in a single direction has been developed in our study [6] . The processor 
schedule for a subset of Nk processors {(/, J, 1 ),..., (J, J, Nk)} that form a pipeline for the Thomas algorithm 
computations in the z direction is shown in Fig. 1. It is assumed, that the computational time per grid node 
is equal for the forward and the local computations and it is 1.5 times greater than that for the backward 
step computations. Therefore, the number of portions of lines for the forward step computations is equal to 
that for the local computations, whereas the number of portions of lines for the backward step computations 
is 1.5 times less. The column of values J(AT, i) corresponds to the K th processor (from 1 to Nk)- Columns 
are shifted so as each horizontal line corresponds to a single time moment in terms of wall clock. Arrows 

>, < and < > denote send, receive and send- receive communications between neighboring 

processors in this processor pipeline. 

The processor schedule corresponds to the execution of the Thomas algorithm in the direction z and the 
local, data-dependent computations of the S x (U n ) operator for the next time step. The proposed algorithm 
is based on the IB-PTA and uses the processor idle time for the computations of the operator S x . Processors 
run the proposed algorithm in a time-staggered way so that the first outermost processor (/, J, 1) completes 
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its computations first and processors do not become idle (Fig. la). For the basic algorithm, the operator S x 
can be computed only after the completion of the PTA for all lines (Fig. lb). The first outermost processor 
completes computations last due to the idle time between the forward and the backward steps (Fig. lb). 
Thus, Fig. 1 illustrates the main advantage of the proposed algorithm over the basic one. 

The other schedule of processors includes the forward step computations of the Thomas algorithm in 
the next direction while processors are idle from the Thomas algorithm in the current direction. To make it 
feasible, the grid nodes rendered by the Thomas algorithm in the current direction must form a contiguous 
extend in the next direction. 

To execute the Thomas algorithm in the x direction, the set of N 2 lines {(j, k) y j = 1 ,..., AT, k = 1 , TV} 
is gathered in groups in such a way that non-rendered fines with the minimum value of index k are taken 
first. For example, consider the subdomain with 14 3 nodes where the 14 2 = 196 fines are gathered in the 17 
groups (see Fig 2a). Each group contains the 12 lines except the last, 17 th group. To execute the Thomas 
algorithm in the x direction, lines are gathered in groups by this method that secures a contiguous extent 
in the y direction. 

The schedule of processors corresponding to this case is shown in Fig. 3. The first two processors 
have idle time (denoted as 0) between the forward and the backward steps and the first processor does not 
complete its tasks first. The reason is that there are no completed lines while these processors become idle. 
The straightforward way to remedy the problem of idle processors is to reduce the number of lines per group. 
However, the communication latency time increases as more messages have to be transfered. 

Consider the case where processors are used for the Thomas algorithm computations in the next direction 
while they are idle repeatedly for the x and y directions. In addition to the previous case, executing the 
Thomas algorithm computations in the y direction, the algorithm gathers fines so as to form a contiguous 
extend of nodes in the 2 direction. Therefore, the set of N 2 lines {(i, &), i = 1, ..., AT, k = 1, N] is gathered 
in groups in such a way that non-rendered fines with the minimum value of index i are taken first. 

Consider the previous example (Fig 2a-b). The first 8 groups of fines must be completed in the current 
direction before processors become idle. Otherwise, processor^ stay idle when they perform computations 
in the y direction waiting for a contiguous extend of grid nodes in the 2 direction. This causes a severe 
restriction on the maximum number of lines per group. 

Thus, scheduling of data-dependent computations while pr< icessors are idle from the Thomas algorithm 
may lead to restrictions of the number of lines per group. By scheduling data- independent computations 
while processors are idle we avoid these restrictions. In the considered case of FS, computations of the S y 
and S z operators while processors are idle from the Thomas algorithm in the x and y directions are data- 
independent. Both the IB-PTA and the PTA with the processor scheduling may be adopted for the two first 
stages of the FS. Still, the IB-PTA is essential for the last stage of the FS. 

The following recommendations with regard to the processor scheduling are drawn for non-linear FS 
methods: 

• If computational time per grid node is greater for the local c imputations than that for the forward step 
computations, the spatial operators are computed for a subset of grid nodes while processors are idle' from 
the Thomas algorithm. Coefficients for the rest of nodes are computed after the completion of the Thomas 
algorithm for all fines in the current direction. These computations are local; therefore, they do not contribute 
to the parallelization penalty time. 

• If local computations are partly data- independent (computatic ns of the l.h.s. coefficients) and partly data- 
dependent (computations of the r.h.s.), then the data-indepen< ent computations are scheduled to execute 
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first while processors are idle. 

• If data-independent computation time per grid node is less than the forward step computational time per 
grid node and greater than a half of the latter time, then the computations of the S y and S z are scheduled 
while processors are idle in the x direction. The Thomas algorithm computations in the z direction are 
scheduled while processors are idle in the y direction. 

For the systems of the Euler or Navier-Stokes equations these suggestions can be applied as follows. 
A typical large-scale aerodynamics code ARC-3D [4], [11] includes three dimensional solution of a system 
of five PDE and uses two versions of ADI. The first version is the original Beam- Warming Approximate 
Factorization scheme [12] leading to the block tri-diagonal Thomas algorithm operating with 5x5 blocks. 
The other version is based on the diagonalization technique of Pulliam and Chaussee [13], leading to the 
decoupling of variables and solution of 5 pent a- diagonal scalar systems in each direction. Naik et al measured 
the elapsed CPU time and number of floating point operations for these two versions ([4], Table 2). Presented 
there are the following results important for our study: (i) the cost of implicit part of the solver (including 
the setup of coefficients of the linear system) is the dominant cost for the two versions; (ii) the cost of the 
r.h.s. computations is approximately equal to the total cost of forward step computations in all directions for 
the second version; (iii) the cost of coefficients setup is greater than the cost of the forward step computations 
for the second version. 

For the scalar penta-diagonal version, the setup of coefficients costs much due to non-linearity of the 
original system of equations, use of fourth order numerical viscosity in implicit side, and multiplications of 
intermediate ADI functions by curvilinear derivatives. For the block tri-diagonal version, the forward step 
computations cost approximately ten times as much as those for the scalar penta-diagonal version. 

For the scalar penta-diagonal version, one may use processors for local computations while they are idle 
from the Thomas algorithm. These local computations include data-independent computations of discretized 
coefficients and data-dependent multiplications of intermediate FS functions by curvilinear derivatives. For 
the block tri-diagonal version, the only way to use the processor idle time is to execute the Thomas algorithm 
in the next direction. For the third stage of the ADI, the data-dependent r.h.s. computations may be 
performed. 

3. Theoretical model of parallelization efficiency. A parallel machine model called the multi- 
computer [7] is used here for the development of the model. A multicomputer comprises a number of von 
Neumann computers, or nodes, linked by an interconnection network. Each computer executes its own 
program. This program may access local memory and may send and receive messages over the network. 
Messages are used to communicate with other computers or, equivalently, to read and write remote memo- 
ries. In the idealized network, the cost of sending a message between two nodes is independent of both node 
location and other network traffic but does depend on message length. Although the most important case 
for parallel computing is 3-D, we start with the 2-D case and further use the same technique to build the 
theoretical model for the 3-D case. 

3.1. 2-D case. The parallelization efficiency is estimated for a square computational domain covered 
with Nd x Nd equal load-balanced subdomains with NxN grid nodes per subdomain Each subdomain belongs 
to a different processor. A single PDE to be solved and the FS method (2) with tri-diagonal matrices in each 
direction are assumed. Extensions to a 3-D case and global computational domain with an unequal number 
of grid nodes in each direction will be considered in the next subsections. 

The communication time for a single message between two processors in the network can be approximated 


7 



by the following linear expression [9], [10]: 

(?) f(L) = b 0 + b x L, 


where 6 0 and b i are communication coefficients and L is the length of the string in words. 

In this section, the additional (penalty) time required for a single FS time step due to communication 
and idle time of processors is estimated. This penalty time is defined as 

( 8 ) F = T. parallel ~ T 3erial /N D , 

where T poro Uei is the actual elapsed time per processor on a MIMD computer, and T seria i is the actual 
elapsed time on a single processor. The function F is composed of three main contributions: 

Fi - the communication time due to the transfer of the forward step coefficients and the backward step 
solution of the Thomas algorithm. 

F 2 - the idle time due to waiting for communication with the neighboring processor. 

F$ - the communication time due to the transfer of the values of the FS variables between neighboring 
subdomains. 

For the square subdomains considered with TV x TV grid nodes, the quantities F\ - F 3 are given by 

(9) Fi = L S (\N/Kt](b 0 + 2b\Ki) + \N/K 2 ]{b Q + b x K 2 )), 

where \N/Ki \ and [ N / K 2 } are the number of messages for the forward and backward steps, respectively, 
L s is the index of the partitioning scheme (2 for 2-D and 3 for 3-D), K x and K 2 are the number of lines per 
message for the forward and the backward step computations. 

The expression of F\ is the same for the proposed and the basic algorithm as the same data must be 
transfered. However, the optimal values of K\ and K 2 are different for these algorithms (see below). 

There are two reasons why the current processor has to wait for its neighbors: 

1. Neighboring processors are synchronized due to the exchange of values of FS variables: the (/, J) th 
processor completes its backward step computations for the last K 2 lines later than the (/, J + l)* /l processor. 
At the next time step, these processors must exchange interface values of the intermediate FS function. 
Therefore the (/, J + l)*' 1 processor has to wait for the (/, J) th processor. 

2. If data- dependent computational tasks are scheduled while processors become idle, processors might 
be idle waiting for the completion of the first group of lines (see Fig. 3). This idle time is equal to the time 
difference between the completion of the first portion of lines l y the backward step and the completion of 
all lines by the forward step. 

The delay time of the current processor is determined as the maximum of these two delays: 

(10) F 2A = L s ■ max(TVAT 2 p 2 , ( 2 {N d - 1) + \p])NK l9l - 1 V 2 9l ). 

First, let us consider that the first reason of delay dominates and thus 

(11) f 2A = L s NK 2 g 2 (= L 8 NK l9l ). 

For the basic algorithm the global synchronization occurs twice per spatial step due to a pipelining 
property of the Thomas algorithm: 

(12) F 2B = L s (N d - l)N(K l9l d K 292 ). 
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A processor sends the interface values of the intermediate FS functions to the neighboring processors at 
each spatial step. Thus, each processor sends 2 L s messages with length N per spatial step. The communi- 
cation time for transfer of the variables between neighboring processors is 

(13) F s = 2L s (b 0 + b 1 N). 

The term F 3 does not depend on K. 

Generally, computational times per grid point are different for the forward step and for the backward 
step computations of the Thomas algorithm. For the IB-PTA the computational work per group of lines 
should be the same for the forward and the backward step computations: 

(14) NK l9l = NK 2 g 2 , 

where g\ and g 2 are the computation times of the forward and backward steps of the Thomas algorithm per 
grid point. Thus, 

(15) K 2 =\^K 1 . 

92 

The penalty function F depends on the following parameters: the number of grid points in one direction 
per subdomain TV, the number of subdomains No, the computation times the Thomas algorithm per grid 
point g i and g 2 and the communication coefficients bo and b\. This function for the proposed algorithm is 
given by Fa = F\ + F 2 a + F 3 and for the basic algorithm Fb = F\ + F 2 b + F$ . The theoretical value of 
parallelization penalty function per grid point is defined as 

(16) PnM = -~,; L x 100 %. 

9ov N a 

The way to seek such values of K 2 that minimize F is to solve the equation 

(17) dF/3K 2 = 0 , 

where 1 < k 2 < N. In order to facilitate this operation the discrete function [x] is replaced by x in the 
following discussion. For the proposed algorithm using the IB-PTA the optimal K 2 value is given by 

(18) K\,ib-pta = \/(l + P)ll Pi K 2 JB-PTA — VI i + pb» 

where 7 = bo/g 2 is the ratio between the communication latency and the characteristic computational time 
per grid node and p = g\/g 2 is the ratio between the forward and the backward step computational times. 
The corresponding value of the parallelization penalty Fa is 

(19) F a = Ng 2 (4 y/y(l -\- p) + 6r -F 4 ( 7 /N V- r)). 

For the basic algorithm the optimal K\ and K 2 values are given by 

(20) ° = \lmPry K «- 

and the corresponding value of the parallelization penalty function: 

(21) F b = N g 2 (4y/y{Nd - 1)(1 + yfp) -f 6 r -f ^isi/N -f r)). 
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The ratio of first components of the F B and the F A which are the leading terms for large 7 and N is 
given by 

( 22 ) r x = ~ *0 + yfl 

y/TTp 

The multiplier Cy = (1 + v /p)/ v /lTp reaches its maximum y/2 if the computational times per grid 
node are equal for the forward and the backward steps ( p = 1). For the bound case g x >> g 2 (p — > oo), the 
multiplier C\ — ► 1. 

The ratio of parallelization penalties for the large N is expressed by 

(23) R 2 = F pTA _ F\ + F 2iPTA -H F 3 ~ Ri + C 2 

Fib~pta Fi -f F 2 jb~pta 4- F 3 \+C 2 ' 

where C 2 = 2.5r/ v / 7(l + p) < 2.5r/ v /2 7. 

Thus, the ratios Ri and R 2 are 0(N ] /2 ) with the factor 1 < C\ < \[2. 

Now we have to verify when the first term in Eq. (10) dominates, i.e., 

(24) (2 (N d - 1) + \p])NK l9l - N 2 gi < NK 2 g 2 . 

Using Eq. (14), the above inequality becomes 

(25) K\ < Fi ir , F 2 < 

where = N/{2(N d - 1) -F |>1 - 1), K 2yT = pN r /(2(N d - 1) f [p] - 1). By substituting Eq. (18) for K 2A 
in the above inequality, we obtain a condition when the first term in Eq. (10) dominates: 


f ( l + ph < 


2(jv rf -i)+rpi - 1 ’ 


If this inequality is satisfied, Eq. (19) and ratios (22, 23) are valid. The critical value of N depends linearly 
upon Nd : 

( 27 ) N cr = (2(N d -l)+ \p \ ) v/OTph/P- 

Assuming that the second term in Eq. (10) dominates, we obtain the optimal value of K from the 
condition dF/dK = 0 to give 


K*a = 


I (l+p)7 
2(Afc-l)+ Pi' 


For the case considered, the condition K' 2A > K 2 , T is expressed by 

(29) / (1 + p)7 pN_ 

V 2(N d - 1) + \p] - 2(N d - 1) + [>] ' 

Therefore, the expression for N' CT is given by 

( 30 ) Kr = \/2 Wd - 1) + [pT yfil+ph/p = Kr/ ^2(N d -l)+\p}. 

Thus, there are three cases in terms of the optimal number of lines solved per message. For N > N cr or 
if data-independent computations are scheduled while processors are idle, the optimal K values are defined 
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by (18). For N f cr < N < N cr , the optimal K values are equal to K r (see (25)). The values K\ %r and K 2 , r 
correspond to the case when the backward step computations for the first portion of lines are completed 
immediately after conclusion of all the forward step computations on the first outermost processor. However, 
for N' cr > N these K r do not give the minimum to the penalty function. In this case the optimal K are 
defined by Eq. (28) , and processors become idle from the conclusion of the last group of lines by the forward 
step till the beginning of the first group of lines by the backward step. Finally, the value of parallelization 
penalty is given by 


N<72(4\/7(1 + p) + 6t + 4(7 /N + r)) 


(31) F A = { 


n 92 ( 


2(i+ P ) 7 (2(^ d -i)+r P ]-i) 


+ + 6r + 4 ('l / N + T )) 


pN ' 2(Nj-l)+ rpl-l 

Ng 2 {2yf{ 1 + ~py^ 2 [Wd - 1) + W - 1) - Np + Gr + 4(7 /N + t)) 


The asymptotic analysis of the above formulae and Eq. (27 and (30) for 
following expression for the asymptotic order of the parallelization penalty: 


if N > N cr 
if N' cr <N< N cr 
if N < N' cr . 

N cr and N' cr leads to the 


(32) 


0(F A ) = { 


O(N) 

0(N d ) 

0{N)0{N l d /2 ) 


if O(N) > 0{N d ) 
if 0(iV d 1/2 ) < 0{N ) < 0(N d ) 
if O(N) < 0{N l J 2 ) . 


Thus, for Nd, N — ► oo the proposed algorithm has an advantage over the basic algorithm unless O(N) < 
0(Ny 2 ), In the last case both algorithms have the same order 0(A r )0(7Vj^ 2 ) (see Eq. 21)). The order of the 
penalty function in terms of the overall number of nodes and the overall number of subdomains (processors) 
is given by 


(33) 


O(Fa) = { 


0(N^ t 2 )/0(N 1 J 2 ) 

0(N l J 2 ) 

0(N% t 2 )/0(N^) 


if 0(N tot ) > O(Nl) 
if 0(Np 2 ) < 0(N tot ) < 0(N 2 D ) 
if 0(N t ot) < 0(Np 2 ) , 


where N tot = (A^ x Nj ) 2 is the overall number of nodes and No = N$ is the overall number of processors. 

The previous case corresponds to the data-dependent computations scheduled while processors are idle. 
Now we will analyze a case where part of computations scheduled to be executed while processors are idle is 
data- independent. In this case the second term in Eq. (10) becomes 


(34) 


(2(N d - 1) + \p])NK l9l - (1 + a)N 2 gi, 


where a = gdi/g l is the ratio of the data-independent computational time g^i to the forward step com- 
putational time per grid node. The data-dependent computations are scheduled to be executed first while 
processors are idle from the Thomas algorithm computations. The expression for K\, r is given by 


(35) 


N{ 1 + a) 

2(JV d -l)+ M “I' 


The expressions for the critical numbers are divided by (1 + a) in Eq. (27,30). Asymptotic results (32) 
are the same as for the previous case of the data-dependent computations scheduled while processors are 
idle. However, the parallelization penalty is reduced: 


F a 


; 2 (4\AK1 + P ) + 6r + 4(7 /N + t)) 

; 2 (2\/(l + p)7(2(A^ d - 1) + fp] - 1) - N( 1 + a)p + 6r + A(^/N + r)) 


if TV > N ct 
if iV' r < N < N cr 
if N < N^ r . 
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3.2. 3-D case. In this case a cubic domain is divided regularly into cubic subdomains with N x N x N 
grid nodes each one. The parallelization penalty function is composed of the same components as in the 
previous 2-D case: 

(36) Ft = L a (\N 2 / K{\(bo + 2biK\) + \N' J /K 2 ]{b 0 + h K 2 )), 


(37) F 2A = L s max{NK 2 g 2 , (2{N d - 1) + [pDJVATjp! - N 3 gi ), 


(38) F 3 = L a (b 0 4- biN 2 }. 

The number of grid nodes at an interface boundary is equal to N 2 and not to N as for the 2-D case; therefore, 
N is replaced by N 2 in the components of the parallelization penalty. 

The parallelization penalty component F 2B is the same as in the 2-D case, and the parallelization penalty 
for the basic algorithm is 


(39) F b = Ng 2 (6v / 'yN(N d - 1)(1 + y'p) + 9-N + 6( 7 /jV + tN)), 

where the optimal K values are 


(40) 


Ki, b = 


1V 7 


p(N d -iy 


K 2 ,b 




N'y 

JV d - 1’ 


If the first term in Eq. (37) dominates, the optimal K values for the proposed algorithm are analogous to 
(18) and are given by 


(41) K\,a = \/N{l + p) 7 /p, K 2 ,a = /fV(l + p) 7 . 

The bound values K\ <r and K 2 , r are 


( 42 ) Ki,t = N 2 /(2(N d - 1) + \p]),K 2<r = pN 2 /(2(N d - 1) + [pD- 


Using expressions (41, 42), the critical number of nodes per sut domain in one direction is obtained: 


(43) N cr 

The values of K 2A and N' cr are 


(2 {Nd ~ 1) + [p) - 1) 2 (1 + p) 7 ) 


-, 1/3 


(44) 


The final expression for the parallelization penalty of the propo ted algorithm in the 3-D case is 


(45 )F a = { 


Ng 2 (6 v/W-Kl + p) + 9Nt + 6(j/N + Nt)) 

iVg 2 ( 3 ( 1 Vj7(2(N d -i) +M ) + + 9 Nr + 6( 7 /JV + Nr)) 

Ng 2 {3y/N(i + p) 7 (2(fV d - 1) + (pl) - N 2 p + 9 Nt -(- 6( 7 /AT + Nt)) 


if N > N cr 
if N' cr < N < N cr 
if N < N' cr . 


The asymptotic analysis follows a pattern very similar to that of the previous subsection: 


(46) 


0(F a ) = 


0(N 2 ) if O(AT) > 9(JV 2/3 ) 

0(N 2 ) if 0{N y/2 ) < O(N) < 0(N 2/3 ) 

0{N d ) if 0{N l J 3 ) < O(N) < 0(N y/2 ) 

0(JV 3 / 2 )0(7Vj /2 ) if O(N) < 9(7Vj /3 ) . 
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If N falls into N f cr < N < N cr (the second line in (45)) then the following cases are considered: 0{N^ 2 ) < 
O(N) < 0(N d ^ 3 ) and 0(N^ Z ) < 0(N) < 0(N^ 2 ). In the former case the leading term is the first 
component whereas in the latter one the leading term is the Nr component. 

The asymptotic analysis of the parallelization penalty for the basic algorithm (39) leads to the following 
expression 


(47) 


0(F b ) = 


0(N 2 ) if O(N) > 0(N d ) 

0(N^ 2 )0(N l d /2 ) if O(N) < 0{N d ) . 


Thus, the proposed algorithm has an advantage over the basic one if 0(N d ^) < O(N) < 0(N d ). Oth- 
erwise, both algorithms are of the same order. If O(N) > 0(N d ), the main component in the parallelization 
penalty becomes 15JVY. This component characterizes the amount of transfered data which is the same for 
these algorithms. If O(N) < (^(TV^ 3 ), the idle time between conclusion of the forward step computations 
and completion of the backward step computations for the first group of lines becomes large and there is no 
longer an advantage of the proposed algorithm over the basic one. 

The order of the parallelization penalty function in terms of the overall number of nodes and the overall 
number of subdomains (processors) is given by 


(48) 


o(f a )= | 


0(N?' t 3 )/0(N 2 D /3 ) 

0{N]> 3 ) 

0(Nli 2 )/0(N 2 D /3 ) 


if 0(N tot ) > 0(N 3 d /2 ) 
if 0(N 4 d /3 ) < 0(Ntat) < 0(N 3 d /2 ), 
if O(Ntot) < 0(Np 3 ) , 


where N to t = (N x N d ) 3 is the total number of grid nodes and N D = N$ is the total number of processors. 
The proposed algorithm has an advantage over the basic algorithm if 0(Np 3 )<O(N tot )<0(N*). 

3.3. 3-D domain with different number of grid points in each direction. The case with different 
number of grid nodes in the x y y and z directions is considered here ( N x ^ N y ^ N z ). We will obtain 
theoretically (where it is possible) the optimal cover of the computational domain with subdomains. 

Consider first the case where the first term in Eq. (10) dominates; therefore, the parallelization penalty 
for the proposed algorithm is given by 


(49) Fa 



(bo 4 - 2biKi } i) 4 


NjNk 

Ki, 2 


(bo 4 b\Ki 4 NiKi,292 4 2 ( 6 q 4 b\Nj 



where z, k are the spatial directions (z ^ j ^ k), K{, Kj and K \ t are the numbers of fines solved 
per message in spatial directions, (in general, K t ^ Kj ^ Kk). The first two terms in the above equation 
represent the communication time due to the transfer of the forward step coefficients and the backward step 
solution. Next term corresponds to the processor idle time due to the local synchronization. The last term 
is equal to the communication time due to the transfer of the FS variables. We group the terms, use Eq. 
(14) and replace [x] by x, which leads to the following expression: 

(50) Fa = (&o(l4 P)~^ — -4^/^202^ 4 561 NjN^ 4 66 0 . 

Both sums in the above equation will be estimated by the following known inequality: 

n n 

(51) 

1=1 1=1 
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where £" =1 a * = n (Iir=i a i) 1/n for equal a t . 

The estimation for the second sum in (50) is given by 

(52) Y, ( N i N k) > 3 (NiNjN *) 3 ' 3 = 3 N% 3 , 

i— 1,2,3 

where N ov = N t NjN k is the overall number of grid nodes per subdomain. This sum is minimal if N t = Nj = 
N k = N, i.e., if the subdomains are cubic. The first sum is estimated by 

(53) Y (! + P)^NjN k (Ki,2 + NiKiw) > 6(6092(1 + p)N t N J N k ) 1 ' 2 = 6(6 0 9 2 (1 + p)N ov ) l > 2 . 

*=1,2,3 


This inequality turns into equality if Nj = Nj = N k = N and K 2 satisfies Eq. (41). Thus, the partitioning 
by cubic subdomains gives the minimum parallelization penalty. In this case, the optimal number of solved 
lines per message is the same in all directions, and the parallelization penalty is equal to the value stated in 
the first line of (45 } 

The parallelizai penalty for the basic method is 


f b= Y 

•=1,2,3 


NjNk 

K it , 


(bo + 2biKi t i) + 


NjNk ' 

K ia 


(b 0 + bxKit) + (N dti - 1 )Ni(Ki, igi + K ia g 2 ) + 2(6 0 + 6 iNjN k ), 


where N d ,i is the number of subdomain partitioning in the i th direction. We can show (see above), that the 
F b reaches its minimum if N, = Nj = N k . In this case, optimal K values are given by 


(54) 


A',.! = 


I 7 IV 

p(N d ,i-iy 


Ni,: 


jN 

V (N d , t -iy 


Thus, for the basic algorithm the optimal cover is also composed of cubic subdomains; however, the optimal 
K values may be different for different directions. Therefore, foj both algorithms the number of subdomains 
in each direction is proportional to the number of grid nodes in this direction, i.e., Nd, x : Nd y : z = 

^toti • Ntoty * NfotZ' 

For the basic algorithm the penalty function for optimal K values is given by (39), where 6(N d - 1) is 
replaced by 2 2 3 y/Nd^i — 1. This sum can be estimated by 


( 55 ) Y, ^ Nd -' - 1 ~ Y > zK /6 . 

i=l,2,3 i=x,y,z 

Thus, for the basic algorithm the parallelization penalty reaches its minimum for a cubic global domain. For 
the proposed algorithm, the parallelization penalty is invariant with respect to the n um ber of nodes in each 
direction (see the first line in Eq (45)). This represents an additional advantage of the proposed algorithm 
over the basic one. 

If the values of K are equal to bound values K T (42), then the cubic subdomains no longer give the 
minimum parallelization penalty for the proposed algorithm. H >wever, partitioning by cubic subdomains is 
chosen in this case as a fair assumption. 


4. Numerical solution of the sample problem. The parallelization method and the model for the 
parallelization efficiency are tested by a numerical solution of a benchmark problem. The non-stationary 
Laplace equation in the cube ft = (-1 < x < 1] x [-1 < y < 1| x [-1 < z < 1] is chosen as the test case. 
The PDE to be solved is 


(56) 


du d2u d2u , d2 u 

dt ~ ax dx 2 +ay !h? +Cz "dz* ' 
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Table 1 

Characteristic parameters of MIMD computers 



parameters of communication 


non-dimensional parameters 

Computer 

bo, pS 

fci, fiS/Word 

92 , pS 

p = 91/92 

7 = b 0 /g 2 

II 

0- 

K 5 

CRAY T3E 

18 

0.1 

0.36 

1.71 

50. 

0.28 

IBM SP 

70 

0.05 

0.28 

1.42 

250. 

0.18 


where a x , a y and a z are non-linear functions of U. There are the following boundary and initial conditions: 

(57) V ( x,y,z)eCt U(x,y,z, 0) = 0, 

V (x, y, z) e dSl U (x, y,z,t) = 1. 

The MIMD computers used in this study are installed in the San Diego Supercomputer Center (SDSC) 
at the University of California, San Diego [14]. SDSC’s CRAY T3E has 272 (maximum 128 for a single task) 
distributed- memory processing elements (PEs), each with 128 megabytes (16 megawords) of memory. Each 
processor is a DEC Alpha 21164 (300 MHz clock). The T3Es run the UNICOS/mk operating system. The 
T3E PEs are relatively inexpensive, with fast processing ability but slow main memory. The theoretical limit 
of 600 Mflops for the 300 Mhz processors of the SDSC T3E applies only to certain operations within the 
registers of the processor. The IBM SP has 128 (maximum 120 for a single task) thin node POWER2 Super 
Chip (P2SC) processors with 256 MBytes of memory on each processor. The SP processors are superscalar 
(implying simultaneous execution of multiple instructions) pipelined chips and are capable of executing up 
to six instructions per clock cycle and four floating point operations per cycle. These nodes run at 160 Mhz 
and are capable of a peak performance of 640 MFLOPS each. 

Our measurements of communication times (bo and bi) for the CRAY T3E and the IBM SP computers 
are presented in Table 1. The maximum length of string is 1000 words. The sample size is 100 messages 
for each string. Communication time is the time required to receive a message of L words which sent from 
another processor. These measurements confirm the linear approximation (7) for the communication time. 

Computational times are obtained from computational experiments on a single processor. Fortran com- 
piler CF90 with the third optimization level is used on the CRAY T3E. The Fortran compiler on the SP is 
IBM’s XL Fortran, also known as xlf. The cash locality is exploited in our computer code, i.e., the ’’implicit” 
direction (index i in Eqs. (4,5)) corresponds to the last index in working arrays for the Thomas algorithm 
computations in all three directions. The values of p are different for CRAY T3E and IBM SP computers 
as ratios between arithmetic operations are compiler- and computer- dependent. 


4.1. Description of a multi-processor code. The code is designed as follows. First, the optimal 
number of grid lines to be solved per message (i.e., the optimal number of groups of lines) is computed as 
described in the previous section. Then, processors are scheduled in each direction according to the algorithm 
described in [6]. These one-directional schedules and schedules of the local computations are joined together 
to use the processor idle time according to recommendations presented in the second section. Thus, the 
static schedule of processors is formed. 

The solver part of the code (Appendix A) does not depend upon a particular schedule of processors. 
The functions handling communications are referenced in a common form without relation to any specific 
message-passing system. The external loop with the loop variable IPX execute lines group- by- group. The 
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array I COM governs communication with six closest neighbors of the current processor. The value of 
ICOM(IPX, I) controls type of communications, as follows: 


(58) 


/ 


ICOM(IPXJ) = { 


0 

1 

2 

3 


processors ( o not communicate 

send 

receive 

simultaneous send and receive 


where 1=1,.. .,6. The send and receive operations transfer either the forward step computations or the 
backward step solution. Before the computations for the current group of lines are executed, each processor 
has completed exchanging data with its neighbors. The only data used for the computations at the current 
time unit are transfered to the processor. At the backward step of the Thomas algorithm computed values 
of the solution are transfered to the processor ahead after they have been computed in the current processor 
at the previous time unit. These values are stored in the array SF. At the forward step interface values 
of the coefficients are not transfered immediately after they have been computed; therefore, these values 
are extracted from matrices of the forward step coefficients DX and GX where they are stored. Pointers 
J 3X, J3Y and J3Z control these data streams in the directions X, y and 2 , respectively. 

After communications have been completed, computations are executed according to the value stored in 
the array, ITX (see (6)). The local, forward and backward step computations are executed in sub-routines 
COEF, FRWD and BCWD, respectively. 


4,2, Results of multiprocessor runs. Results of multiprocessor computer runs are shown in Tables 
2a, b for CRAY T3E and IBM SP computers, respectively. The number of grid nodes per subdomain varies 
from 10 3 to 20 3 for the CRAY T3E and from 12 3 to 20 3 for the SP computer. The computational domain 
is covered with 3 3 ,4 3 or 5 3 equal cubic subdomains. Therefore, the total number of grid nodes varies from 
27 x 10 3 (46.5 x 10 3 for the SP computer) to 10 6 . 

Processors compute coefficients a y and a z while they are idle from the Thomas algorithm in the x and 
y directions. These computations are data- independent. In this particular case a = 1.2 for the CRAY T3E 
and a = 1.3 for the SP computer (see Eq. (34)). The data-dependent computations of the coefficients a x 
are executed while processors are idle in the 2 direction. The optimal K values in each direction for the 
proposed algorithm are computed by one of the Eqs. (41) (42), (44). These methods and denoted as 1,2 
and 3, respectively, in Tables 2a, b. For the basic algorithm, optimal values of K are computed by (40). 
The values of K for the proposed algorithm are greater than th Dse for the basic algorithm that confirms an 
advantage of the proposed algorithm in terms of the number of messages. 

The parallelization penalty (Pn) is obtained by 

(59) Pn = - P ~ Tl x 100%, 

T\ 

where 7> and T\ are measured computational times on P pr< cessors and on a single processor with the 
same size of subdomain, respectively. The parallelization penalty obtained from computer experiments is in 
good agreement with that computed by the theoretical model (Id). The difference between experimental and 
theoretical values increases with the decrease in the number of gri 4 nodes per processor. This can be explained 
by those details of the computer architecture that are not taken into account in the developed theoretical 
parallelization model. To match theoretical and experimental results for the SP computer, experimental 
data are filtered. If the elapsed time for a time step is 30% gre iter than the averaged value, this time step 
is excluded from the sample. These time steps make 3 - 5% of overall sample. The parallelization penalty 
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for the basic algorithm is approximately twice as much that for the proposed algorithm for the considered 
range of the number of nodes per processor and the number of processors. 

To estimate the parallelization penalty for the large number of processors which are not available yet we 
compute the theoretical parallelization penalty for 10 3 processors and present these results in Tables 2a,b. 
The ratio of parallelization penalties of the basic and the proposed algorithm increases with the number of 
processors. 

Finally, the computer runs are executed for an unequal number of grid nodes in various directions. The 
partitionings are 8x4x2 and 16 x 2 x 2, the total number of processors is equal to 64. The parallelization 
penalty for the basic algorithm is greater for the considered cases than for the 4 3 partitioning; however, for 
the proposed algorithm parallelization penalty is almost equal to that for the 4 3 partitioning. This confirms 
the theoretical conclusions of subsection 3.3. 

5. Conclusions. A parallel implicit numerical algorithm for the solution of directionally split 3-D 
problems is proposed. This algorithm provides exactly the same solution as its serial analogue at each time 
step. While executing this algorithm, processors run in a time-staggered way without global synchronization 
in each direction. The proposed algorithm uses the idle processor time either for computations of discretized 
coefficients of the PDE to be solved or for the Thomas algorithm computations in the next spatial direction. 
To make the algorithm feasible, the reformulated version of the pipelined Thomas algorithm is used. 

Static scheduling of processors is adopted in this study. Various computational tasks which may be 
executed while processors are idle from the Thomas algorithm in the current direction are discussed and 
recommendations about optimal scheduling of processors are drawn. 

A theoretical model of the parallelization efficiency is developed. This model is used to estimate the 
parallelization penalty for the basic and the proposed algorithms. The optimal number of lines to be solved 
per message is defined by this model. The asymptotic analysis shows the relations between the number of 
grid nodes per subdomain and the number of processors which ensure an advantage of the proposed algorithm 
over the basic one. Finally, this model is used for the optimal partitioning of a computational domain with 
an unequal number of grid nodes in spatial directions. 

The parallel computer code uses the modular design technique: a schedule of the processor tasks is 
assigned before computations by the numerical algorithm and communications are separated from compu- 
tational modules. Experiments with the multidomain code in distributed memory multiprocessor systems 
(CRAY T3E and IBM SP) show a reasonable parallelization penalty for a wide range of the number of 
grid nodes and the number of processors. The parallelization penalty agrees well with that obtained by the 
theoretical model. 

6. Acknowledgment. The author wishes to thank Professor David Keyes (ICASE and ODU) for useful 
discussion about parallel numerical algorithms. 
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Appendix A 

Fragment of computer code which performs the solver of the proposed algorithm 


c 

c loop by portions of lines solved per one message 

c 

DO IPX=1 , IEND 
c 

c ********** logistics of communications *************************** 
c 

IC0MT=IC0M(IPX , I) 

c a. receive from neighboring processors (only one receive is expected) 
DO 1=1,6 

IF(ICOMT.GE. 2) go to 62 
END DO 
go to 64 
62 continue 

c receive forward step coefficients from previous subdomain 
IF (I.LE.3) THEN 

IF (ICOMT.EQ . 2) THEN 

call rec v (RF,2*K1 ,nbr (I) , itagf (I) ) 

ELSE 

call sendrecv(SF,K2,nbr (I) ,itagb(I) ,RF,2*K1 ,nbr (I) , itagf (I)) 

END IF 

go to 64 
END IF 

c receive backward step solution from subdomain ahead 
IF(I . GE.4) THEN 
11=1-3 

IF (ICOMT.EQ. 2) THEN 

call recv(RF,K2,nbr (I) , itagb(Il) ) 

ELSE 

IF(I.EQ.4) call STORE (SF,DX,GX, J3X) 

IF (I .EQ.5) call STORE (SF,DY, GY, J3Y) 

IFCI.EQ.6) call STORE ( SF, DZ, GZ , J3Z) 

call sendrecv(SF,2*Kl ,nbr (I) , itagf (II) ,RF ,K2,nbr (I) ,itagb(Il)) 
END IF 
END IF 

64 continue 

c b. send to neighboring domains 
DO 1=1,6 

IF (IC0M(IPX, I) .EQ . 1) then 
IF (I.GE.4) THEN 

c send forward step coefficients 
11 = 1-1 

IF(I .EQ .4) call STORE (SF,DX,GX, J3X) 

IF (I .EQ.5) call STORE ( SF , DY , GY , J3 Y) 
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IF(I.Eq.6) call STORE (SF , DZ , GZ , J3Z ) 

call send(SF,Kl,nbr(I),itagf (ID) 

END IF 

c send backward step solution 

IFCI.LE.3) call send(SF,K2,nbr(I) ,itagb(I)) 
END DO 
c 

c *********** logistics of computations 
c 

IF (ITX(IPX) .EQ.4) THEN 
c computations of coefficients 

IF(IDIR.EQ.l) call COEF(AX,BX,CX) 
IFCIDIR.EQ.2) call COEF(AY,BY,CY) 
IF(IDIR.EQ.3) call COEF(AZ,BZ,CZ) 

END IF 

c forward step computations 

IF (ITX(IPX).Eq.n call FRWD(DX.GX) 

IF (ITX(IPX) .Eq.2) call FRWD(DY.GY) 

IF (ITX(IPX) .Eq.3) call FRWD(DZ,GZ) 
c backward step computations 

IF (ITX(IPX) .Eq.-l) call BCWD(VX,DX,GX) 

IF (ITX(IPX).Eq.-2) call BCWDCVX ,DY,GY) 

IF (ITX(IPX).Eq.-3) call BCWD(VX,DZ,GZ) 

END DO 
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3 


b 


3 



Fig. 1. The processor schedule: (a) The IB-PTA is used for the z direction, processors compute local coefficients of the 
non-linear PDE while they are idle from the IB-PTA; (b) the baste PTA algorithm is used, processors are idle between the 
forward and backward steps. iV d = 4, Lj — 9, L& = 6 
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Fig. 2. Gather of fines in groups, (a) The Thomas algorithm in t) e y direction is executed while processors are idle 
computing the Thomas algorithm in the x direction; ( b ) In addition, the Tf -omas algorithm in the z direction is executed while 
processors are idle computing the Thomas algorithm in the y direction. N umbers denote the group of lines to which this line 
belongs. 


22 



1 


1 1 

— > — > 

111 
— > — > — > 

1111 
— > — > — > — > 

11111 
— > — > — > — > — > 

111111 
— > — > — > — > — > 

111111 
— > — > — > — > 

11111-1 
— > — > — > < — > 

1111-11 
— > — > < — > 

0 1 1 -1 1-1 

< — > < — > 

0 0 -1 1 -1 1 

< — > < — > — > 

0 -1 1 -1 1 1 

< — <— > — > 

-1 2 -1 1 1 -1 

< > < — > 

2-1 2 1 -1 1 

< < — > 

-1 2 2-1 1 -1 
< < — > 

2 2-1 2-1 1 

< < — > > 

2 -1 2 -1 1 1 

< < 

-1 2-1 2 2-1 

< < — > 

2-1 2 2 -1 1 

< < 

-1 2 2 -1 2 -1 

< < 

2 2-1 2-1 2 

< < 

2-1 2 -1 2 2 

< < 

- 12-1222 

< 

2 -1 2 2 2 2 

< 

-1 2 2 2 2 2 

2 2 2 2 2 2 

2 -2 -2 2 2 2 

-2 -2 - 2-2 2 2 
-2 -2 -2 -2 -2 2 

Fig. 3. The processor schedule: processors compute the forward step of the Thomas algorithm in the y direction while 
they are idle from the Thomas algorithm computations in the x direction 
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Table 2 

Parallelization penalties for the benchmark problem, Ntot - total number of nodes, N - number of grid nodes per subdomain 
in a single direction, K x ,K y and K z - number of lines solved by backwarc step per message, Method - method of computation 
of K x ,Ky and K z values, respectively (1-by Eq. (41), 8 - by Eq. ( 42 ) and 3- by Eq. ( 44 )), Pn- penalty, obtained from 
computational experiments (Eq. (59)); PnM- theoretical value of penalty (Eq. (16)) 


a. CRAY T3E multiprocesso r system. 

Proposed algorithm Basic algorithm 


Ntot 

N 

K x 

Ky 

K z 

Method Pn, % PnM, % 

K x 

Ky 

K z 

Pn, % 

PnM, % 

27000 

10 

51 

51 

26 

partitioning 3x3x3 
112 52.41 39.90 

22 

22 

22 

108.35 

96.74 

91125 

15 

63 

63 

62 

112 23.80 19.97 

27 

27 

27 

53.60 

49.93 

216000 

20 

74 

74 

74 

111 15.01 13.56 

31 

31 

31 

33.86 

31.33 

64000 

10 

42 

42 

20 

partitioning 4x4x1 
222 58.71 43.94 

18 

18 

18 

132.4 

114.53 

216000 

15 

63 

63 

47 

112 31.10 21.34 

22 

22 

22 

59.23 

57.46 

512000 

20 

74 

74 

74 

111 15.19 13.56 

26 

26 

26 

39.46 

36.67 

125000 

10 

35 

35 

17 

partitioning 5x5x5 
222 67.29 54.5 4 

16 

16 

16 

152.20 

131.61 

421875 

15 

63 

63 

38 

112 33.84 22.69 

19 

19 

19 

70.17 

66.33 

1000000 

20 

74 

74 

67 

112 16.12 13.01 

22 

22 

22 

45.08 

41.59 


partitioning 10 x 10 x 10 

1000000 10 19 19 10 222 - 92.66 11 11 11 - 186.33 

3375000 15 40 40 19 222 - 29.84 13 13 13 - 93.51 

8000000 20 74 74 35 222 - 13.80 15 15 15 - 59.09 

partitioning 8x4x2 

64000 10 22 42 40 222 56.74 42.37 12 18 31 136.83 119.59 

216000 15 51 63 63 211 23.12 20.89 15 22 38 61.23 59.77 

512000 20 74 74 74 111 15.11 13.56 17 26 44 39.23 37.71 

partitioning 16x2x2 

64000 10 11 51 40 212 63.85 61.17 8 31 31 143.32 129.57 

216000 15 26 63 63 211 28.39 25.76 10 38 38 67.56 64.29 

512000 20 46 74 74 211 13.08 14.25 12 44 44 42.71 40.74 
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